Triton 编程入门：从急切模式操作到基于块的并行计算

从 PyTorch 急切模式 过渡到 Triton 要求我们从将张量视为整体对象，转变为将其看作一组离散且可管理的块或区块。

1. PyTorch 与 Triton 张量对比

必须明确区分 Triton 张量与 PyTorch 张量。PyTorch 张量是一个 主机端的 Python 对象 封装了形状、数据类型、设备、步长和存储元数据。相比之下，Triton 处理的是特定内存块内的 原始数据指针 ，从而实现更底层的优化。

2. 急切模式的瓶颈

在标准的急切执行中，每次操作（例如加法后接 ReLU）都需要一次独立的内核启动和一次 全局内存往返传输。这是现代 GPU 计算中的主要瓶颈。Triton 通过在单个内核中融合多个操作来克服这一问题，该内核直接在片上内存中处理数据块（例如 128、256 或 512 个元素）。融合操作，这些操作在一个内核中处理数据块（例如 128、256 或 512 个元素），并直接在片上内存中进行。

3. 基于块的范式

与 CUDA 线程的标量级思维不同，Triton 在块级别使用 SPMD（单程序多数据） 。你只需编写一个内核，Triton 就会在网格中启动多个实例。每个实例使用其 program_id 来计算它所拥有的“块”内存区域。

4. 环境设置

开始之前，请 在干净的环境中安装 Triton （使用 Conda 或 venv）以确保不会与现有的 CUDA 工具包发生依赖冲突： pip install triton。

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

What is the primary difference between a PyTorch tensor and a Triton tensor within a kernel?

Triton tensors contain Python metadata like strides; PyTorch tensors are raw pointers.

A PyTorch tensor is a host-side object wrapping metadata; a Triton tensor represents blocks of data processed at the compiler level.

There is no difference; they are the same object.

Triton tensors are stored on the CPU, while PyTorch tensors are on the GPU.

QUESTION 2

Why is 'Eager Mode' considered a bottleneck for modern GPU performance?

Because it uses too much CPU memory.

Every operation requires a separate kernel launch and a global memory round-trip.

It cannot handle floating-point numbers.

It lacks support for the Python language.

QUESTION 3

What is the result of installing Triton in a 'dirty' environment with conflicting CUDA toolkits?

Triton will automatically fix the CUDA path.

It may lead to library version mismatches and kernel compilation errors.

The GPU will run faster due to multiple toolkit options.

Triton does not use CUDA, so there is no conflict.

QUESTION 4

Draw the mapping from pid to index range for N=1000, BLOCK_SIZE=256.

pid 0: [0, 256); pid 1: [256, 512); pid 2: [512, 768); pid 3: [768, 1000)

pid 0: [0, 1000)

pid 0: [0, 256); pid 1: [257, 512); pid 2: [513, 768); pid 3: [769, 1000)

pid 1: [0, 256); pid 2: [256, 512); pid 3: [512, 768); pid 4: [768, 1000)

QUESTION 5

In block-based parallelism, the instruction shift moves from 'compute one element' to:

'Compute one entire tensor'.

'Compute one block of 128/256/512 elements'.

'Compute one scalar at a time'.

'Let the CPU handle the math'.